AITopics | lipschitz smoothness

Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation

Neural Information Processing SystemsJun-14-2026, 08:47:53 GMT

We propose and study Sparse Polyak, a variant of Polyak's adaptive step size, designed to solve high-dimensional statistical estimation problems where the problem dimension is allowed to grow much faster than the sample size. In such settings, the standard Polyak step size performs poorly, requiring an increasing number of iterations to achieve optimal statistical precision-even when, the problem remains well conditioned and/or the achievable precision itself does not degrade with problem size. We trace this limitation to a mismatch in how smoothness is measured: in high dimensions, it is no longer effective to estimate the Lipschitz smoothness constant. Instead, it is more appropriate to estimate the smoothness restricted to specific directions relevant to the problem (restricted Lipschitz smoothness constant). Sparse Polyak overcomes this issue by modifying the step size to estimate the restricted Lipschitz smoothness constant. We support our approach with both theoretical analysis and numerical experiments, demonstrating its improved performance.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: North America > United States > Texas (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Add feedback

We propose and study Sparse Polyak, a variant of Polyak's adaptive step size, designed to solve high-dimensional statistical estimation problems where the problem dimension is allowed to grow much faster than the sample size. In such settings, the standard Polyak step size performs poorly, requiring an increasing number of iterations to achieve optimal statistical precision-even when, the problem remains well conditioned and/or the achievable precision itself does not degrade with problem size. We trace this limitation to a mismatch in how smoothness is measured: in high dimensions, it is no longer effective to estimate the Lipschitz smoothness constant. Instead, it is more appropriate to estimate the smoothness restricted to specific directions relevant to the problem (restricted Lipschitz smoothness constant). Sparse Polyak overcomes this issue by modifying the step size to estimate the restricted Lipschitz smoothness constant. We support our approach with both theoretical analysis and numerical experiments, demonstrating its improved performance.

algorithm 1, convergence, theorem 1, (15 more...)

arXiv.org Machine Learning

2509.09802

Country:

North America > United States > Texas > Brazos County > College Station (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.30)

Add feedback

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

He, Qi, Yu, Peiran, Chen, Ziyi, Huang, Heng

arXiv.org Machine LearningJul-15-2025

Shuffling-type gradient methods are favored in practice for their simplicity and rapid empirical performance. Despite extensive development of convergence guarantees under various assumptions in recent years, most require the Lipschitz smoothness condition, which is often not met in common machine learning models. We highlight this issue with specific counterexamples. To address this gap, we revisit the convergence rates of shuffling-type gradient methods without assuming Lipschitz smoothness. Using our stepsize strategy, the shuffling-type gradient algorithm not only converges under weaker assumptions but also match the current best-known convergence rates, thereby broadening its applicability. We prove the convergence rates for nonconvex, strongly convex, and non-strongly convex cases, each under both random reshuffling and arbitrary shuffling schemes, under a general bounded variance condition. Numerical experiments further validate the performance of our shuffling-type gradient algorithm, underscoring its practical efficacy.

artificial intelligence, lipschitz smoothness, machine learning, (18 more...)

arXiv.org Machine Learning

2507.08913

Country:

North America > United States > Texas > Tarrant County > Arlington (0.04)
North America > United States > Maryland > Prince George's County > College Park (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Towards Efficient Risk-Sensitive Policy Gradient: An Iteration Complexity Analysis

Liu, Rui, Noorani, Erfaun, Tokekar, Pratap, Baras, John S.

arXiv.org Artificial IntelligenceMar-13-2024

Reinforcement Learning (RL) has shown exceptional performance across various applications, enabling autonomous agents to learn optimal policies through interaction with their environments. However, traditional RL frameworks often face challenges in terms of iteration complexity and robustness. Risk-sensitive RL, which balances expected return and risk, has been explored for its potential to yield probabilistically robust policies, yet its iteration complexity analysis remains underexplored. In this study, we conduct a thorough iteration complexity analysis for the risk-sensitive policy gradient method, focusing on the REINFORCE algorithm and employing the exponential utility function. We obtain an iteration complexity of $\mathcal{O}(\epsilon^{-2})$ to reach an $\epsilon$-approximate first-order stationary point (FOSP). We investigate whether risk-sensitive algorithms can achieve better iteration complexity compared to their risk-neutral counterparts. Our theoretical analysis demonstrates that risk-sensitive REINFORCE can have a reduced number of iterations required for convergence. This leads to improved iteration complexity, as employing the exponential utility does not entail additional computation per iteration. We characterize the conditions under which risk-sensitive algorithms can achieve better iteration complexity. Our simulation results also validate that risk-averse cases can converge and stabilize more quickly after approximately half of the episodes compared to their risk-neutral counterparts.

algorithm, iteration complexity, risk-sensitive reinforce, (12 more...)

arXiv.org Artificial Intelligence

2403.08955

Country:

North America > United States > Maryland > Prince George's County > College Park (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Rethinking SIGN Training: Provable Nonconvex Acceleration without First- and Second-Order Gradient Lipschitz

Sun, Tao, Chen, Congliang, Qiao, Peng, Shen, Li, Liu, Xinwang, Li, Dongsheng

arXiv.org Artificial IntelligenceOct-23-2023

Sign-based stochastic methods have gained attention due to their ability to achieve robust performance despite using only the sign information for parameter updates. However, the current convergence analysis of sign-based methods relies on the strong assumptions of first-order gradient Lipschitz and second-order gradient Lipschitz, which may not hold in practical tasks like deep neural network training that involve high non-smoothness. In this paper, we revisit sign-based methods and analyze their convergence under more realistic assumptions of first- and second-order smoothness. We first establish the convergence of the sign-based method under weak first-order Lipschitz. Motivated by the weak first-order Lipschitz, we propose a relaxed second-order condition that still allows for nonconvex acceleration in sign-based methods. Based on our theoretical results, we gain insights into the computational advantages of the recently developed LION algorithm. In distributed settings, we prove that this nonconvex acceleration persists with linear speedup in the number of nodes, when utilizing fast communication compression gossip protocols. The novelty of our theoretical results lies in that they are derived under much weaker assumptions, thereby expanding the provable applicability of sign-based algorithms to a wider range of problems.

algorithm, assumption, lipschitz, (16 more...)

arXiv.org Artificial Intelligence

2310.14616

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York (0.04)
Europe > Russia (0.04)
(5 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

Liu, Bo, Xie, Tengyang, Xu, Yangyang, Ghavamzadeh, Mohammad, Chow, Yinlam, Lyu, Daoming, Yoon, Daesub

arXiv.org Machine LearningSep-9-2018

Risk management plays a central role in sequential decision-making problems, common in fields such as portfolio management [Lai et al., 2011], autonomous driving [Maurer et al., 2016], and healthcare [Parker, 2009]. A common risk-measure is the variance of the expected sum of rewards/costs and the mean-variance tradeoff function [Sobel, 1982; Mannor and Tsitsiklis, 2011] is one of the most widely used objective functions in risk-sensitive decision-making. Other risk-sensitive objectives have also been studied, for example, Borkar [2002] studied exponential utility functions, Tamar et al. [2012] experimented with the Sharpe Ratio measurement, Chow et al. [2018] studied value at risk (VaR) and mean-VaR optimization, Chow and Ghavamzadeh [2014], Tamar et al. [2015b], and Chow et al. [2018] investigated conditional value at risk (CVaR) and mean-CVaR optimization in a static setting, and Tamar et al. [2015a] investigated coherent risk for both linear and nonlinear system dynamics. Compared with other widely used performance measurements, such as the Sharpe Ratio and CVaR, the mean-variance measurement has explicit interpretability and computational advantages [Markowitz et al., 2000; Li and Ng, 2000]. For example, the Sharpe Ratio tends to lead to solutions with less mean return [Tamar et al., 2012].

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

1809.02292

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Industry:

Information Technology (0.54)
Health & Medicine (0.48)
Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

New Optimisation Methods for Machine Learning

Defazio, Aaron

arXiv.org Machine LearningMar-18-2016

A thesis submitted for the degree of Doctor of Philosophy of The Australian National University. In this work we introduce several new optimisation methods for problems in machine learning. Our algorithms broadly fall into two categories: optimisation of finite sums and of graph structured objectives. The finite sum problem is simply the minimisation of objective functions that are naturally expressed as a summation over a large number of terms, where each term has a similar or identical weight. Such objectives most often appear in machine learning in the empirical risk minimisation framework in the non-online learning setting. The second category, that of graph structured objectives, consists of objectives that result from applying maximum likelihood to Markov random field models. Unlike the finite sum case, all the non-linearity is contained within a partition function term, which does not readily decompose into a summation. For the finite sum problem, we introduce the Finito and SAGA algorithms, as well as variants of each. For graph-structured problems, we take three complementary approaches. We look at learning the parameters for a fixed structure, learning the structure independently, and learning both simultaneously. Specifically, for the combined approach, we introduce a new method for encouraging graph structures with the "scale-free" property. For the structure learning problem, we establish SHORTCUT, a O(n^{2.5}) expected time approximate structure learning method for Gaussian graphical models. For problems where the structure is known but the parameters unknown, we introduce an approximate maximum likelihood learning algorithm that is capable of learning a useful subclass of Gaussian graphical models.

artificial intelligence, bayesian inference, machine learning, (17 more...)

arXiv.org Machine Learning

1510.02533

Country: North America > United States (1.00)

Genre: